The purpose of this analysis is to gain a deeper understanding of the Leaf bracelet , a product from the wearable health and wellness company Bellabeat. By analyzing the usage data of the FITBIT smart device, I aim to learn more about how people are currently using their devices. Specifically, I want to identify trends and patterns in the data that will help us improve our understanding of consumer usage. To achieve this goal, I will need to analyze data on customer demographics, product usage, and sales data. We will also need to perform statistical analyses to identify significant trends and patterns in the data. Based on my findings, I will make recommendations for how to optimize our marketing efforts and improve product performance.
The data source for this analysis is a public dataset called the FitBit Fitness Tracker Data, which was made available through Mobius on Kaggle under the CC0: Public Domain license. The data set contains personal fitness tracker information from 30 Fitbit users, who participated in a distributed survey conducted via Amazon Mechanical Turk between 03.12.2016-05.12.2016. The data includes minute-level output for physical activity, heart rate, and sleep monitoring, and provides insights into the daily habits of smart device users.
For my analysis, I used the following tables from the FitBit Fitness Tracker Data:
Before analyzing the data, I downloaded the three CSV files and imported them into Big Query for cleaning. During the data preparation process, I encountered an error message related to the timestamp format in the Sleepday - sleepDay_merged table. The error indicated that Big Query was unable to parse the timestamp ‘4/12/2016 12:00:00 AM’ as a valid TIMESTAMP value due to the unrecognized time zone ‘AM’. To fix this error, I modified the timestamp format in the CSV file to a format that is recognized by Big Query.
To clean the data, I used SQL within Big Query to modify each table individually. The cleaning process is shown below.
Checking if there are any nulls in any columns in the daily_activity table
Checking
if there are any nulls in any columns in the daily_activity table
Validating the match of survey dates 04.12.2016-05.12.2016 and the SedentaryActiveDistance column in the daily_activity table should not be below zero since Sedentary means inactive.
Counting
the highest and lowest calories burned and noticing that there is 0
calories as the lowest. Therefore finding how many particpants have only
0 calories burned.
3
partipants have 0 calories for the same days of activity.
Finding how many participants have a LoggedActivitiesDistance that is greater than zero since zero is typically registered
Checking if there are any nulls and duplicates any of the columns in the daily_intense table
Checking that there is different activity days per participant and validating the match of survey dates 04.12.2016-05.12.2016
Checking what is the most and least amount of minututes for the LightlyActiveMinutes and FairlyActiveMinutes column
Checking
and counting the Outliers in the FairlyActiveMinutes column
Creating another column that is called fair_validity that shows if the SedentaryActiveDistance is below 74 to make that an invalid number and counting the amount of invalid entries there are
Checking what is the most and least amount of minutes for the VeryActiveMinutes column and checking for outliers and the total count of outliers.
Creating another column called very_validity that shows if VeryActiveMinutes is below 120 to make that an invalid number
The
SedentaryActiveDistance column in the daily_intense table should not be
below zero since Sedentary means inactive. I counted the highest and
lowest SedentaryActiveDistance and creating another colum that called
SedentaryActiveDistance_validitity that shows if SedentaryActiveDistance
is below 0.01 and making that an invalid number.
Since there are inconsistencies in the decimal places for the columns LightActiveDistance, ModeratelyActiveDistance, and VeryActiveDistance I rounded all numbers by three decimal points.
Checking if there are any nulls and duplicates in any columns in the daily_sleep table. Also checking if the the dates match the survey dates of 04.12.2016-05.12.2016.
Checking what is the most and least amount of sleep records for the TotalSleepRecords and TotalMinutesAsleep column
Creating another colum that called totalsleep_validitity that shows if TotalMinutesAsleep is below 61 and greater than 775 making that an invalid number
Checking what is the most and least amount of assuming minutes in bed
for the TotalTimeInBed column. Also,checking the number of Outliers and
creating another colum that called TotalTimeInBed_validitity that shows
if TotalTimeInBed is below 61 and greater than 961 making that an
invalid number.
Checking if there are any nulls and duplicates in any columns in the daily_steps table. Also checking if the the dates match the survey dates of 04.12.2016-05.12.2016.
install.packages('tidyverse')
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.0
## ✔ readr 2.1.3 ✔ forcats 0.5.1 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
install.packages('dplyr')
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(dplyr)
install.packages('lubridate')
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
install.packages('janitor')
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ggplot2)
library(readr)
daily_activity <- read.csv("/cloud/project/bellabeat/dailyActivity_merged.csv")
daily_intense <- read.csv("/cloud/project/bellabeat/dailyIntensities_merged.csv")
daily_sleep <- read.csv("/cloud/project/bellabeat/sleepday - sleepDay_merged.csv")
daily_steps <- read.csv("/cloud/project/bellabeat/dailySteps_merged.csv")
<
head(daily_activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(daily_intense)
## Id ActivityDay SedentaryMinutes LightlyActiveMinutes
## 1 1503960366 4/12/2016 728 328
## 2 1503960366 4/13/2016 776 217
## 3 1503960366 4/14/2016 1218 181
## 4 1503960366 4/15/2016 726 209
## 5 1503960366 4/16/2016 773 221
## 6 1503960366 4/17/2016 539 164
## FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance
## 1 13 25 0
## 2 19 21 0
## 3 11 30 0
## 4 34 29 0
## 5 10 36 0
## 6 20 38 0
## LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
## 1 6.06 0.55 1.88
## 2 4.71 0.69 1.57
## 3 3.91 0.40 2.44
## 4 2.83 1.26 2.14
## 5 5.04 0.41 2.71
## 6 2.51 0.78 3.19
head(daily_sleep)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1503960366 4/12/2016 1 327 346
## 2 1503960366 4/13/2016 2 384 407
## 3 1503960366 4/15/2016 1 412 442
## 4 1503960366 4/16/2016 2 340 367
## 5 1503960366 4/17/2016 1 700 712
## 6 1503960366 4/19/2016 1 304 320
head(daily_steps)
## Id ActivityDay StepTotal
## 1 1503960366 4/12/2016 13162
## 2 1503960366 4/13/2016 10735
## 3 1503960366 4/14/2016 10460
## 4 1503960366 4/15/2016 9762
## 5 1503960366 4/16/2016 12669
## 6 1503960366 4/17/2016 9705
I am joining these two data frames together to analyze how these two are correlated to each other. Through this join I will be answering different questions.
sleep_and_intensity <- daily_intense %>%
full_join(daily_sleep, by="Id")
The maxium amount of very active minutes per particpant
max_very_act_mins <- aggregate(VeryActiveMinutes ~ Id, sleep_and_intensity, max)
print(max_very_act_mins)
## Id VeryActiveMinutes
## 1 1503960366 78
## 2 1624580081 186
## 3 1644430081 51
## 4 1844505072 2
## 5 1927972279 16
## 6 2022484408 104
## 7 2026352035 3
## 8 2320127002 20
## 9 2347167796 66
## 10 2873212765 75
## 11 3372868164 24
## 12 3977333714 50
## 13 4020332650 65
## 14 4057192912 3
## 15 4319703577 27
## 16 4388161847 120
## 17 4445114986 34
## 18 4558609924 66
## 19 4702921684 22
## 20 5553957443 69
## 21 5577150313 210
## 22 6117666160 26
## 23 6290855005 33
## 24 6775888955 70
## 25 6962181067 62
## 26 7007744171 64
## 27 7086361926 87
## 28 8053475328 132
## 29 8253242879 49
## 30 8378563200 137
## 31 8583815059 77
## 32 8792009665 10
## 33 8877689391 124
Average amount of sleep per participant according to their maxium VeryActiveMinutes
averages1 <- sleep_and_intensity %>%
group_by(Id) %>%
filter(VeryActiveMinutes == max(VeryActiveMinutes)) %>%
summarize(avg_sleep_mins = mean(TotalMinutesAsleep))
print(averages1)
## # A tibble: 33 × 2
## Id avg_sleep_mins
## <dbl> <dbl>
## 1 1503960366 360.
## 2 1624580081 NA
## 3 1644430081 294
## 4 1844505072 652
## 5 1927972279 417
## 6 2022484408 NA
## 7 2026352035 506.
## 8 2320127002 61
## 9 2347167796 447.
## 10 2873212765 NA
## # … with 23 more rows
Graphing the relationship between max VeryActiveMinutes and TotalMinutesAsleep
ggplot(averages1, aes(x = Id, y = avg_sleep_mins)) +
geom_point(color = "blue") +
geom_smooth(color = "red") +
geom_text(aes(label = Id), size = 3, hjust = 0, vjust = 0) +
xlab("Id") + ylab("Average Sleep Duration (minutes)") +
ggtitle("Average Sleep Duration by participant according to their max VeryActiveMinutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 9 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 9 rows containing missing values (`geom_point()`).
## Warning: Removed 9 rows containing missing values (`geom_text()`).
According to the graph the less minutes someone is active, the more minutes they tend to sleep.
I am joining these two data frames together to analyze how these two are correlated to each other. Through this join I will be answering different questions.
activity_and_steps <- daily_activity %>%
full_join(daily_steps, by="Id")
Showing the correlation between Total steps and Calories burned per participant from the new joined data frame
steps_calories_by_id <- activity_and_steps %>%
group_by(Id) %>%
summarize(total_steps = sum(TotalSteps), calories = sum(Calories))
print(steps_calories_by_id)
## # A tibble: 33 × 3
## Id total_steps calories
## <dbl> <int> <int>
## 1 1503960366 11644189 1745579
## 2 1624580081 5519891 1425504
## 3 1644430081 6554670 2530170
## 4 1844505072 2479442 1512118
## 5 1927972279 880400 2088067
## 6 2022484408 10927190 2412079
## 7 2026352035 5349763 1480560
## 8 2320127002 4532913 1656919
## 9 2347167796 3084372 662076
## 10 2873212765 7261099 1842206
## # … with 23 more rows
Graphing the correlation between Calories and Totalsteps burned per participant from the new joined.
ggplot(activity_and_steps, aes(x = Calories )) +
geom_bar() +
xlab("Totalcals") + ylab("Total steps") +
ggtitle("Total Calories Burned by participants")
Calculating the Average Total Steps per participant from the new joined data frame
average_totalsteps <- activity_and_steps %>%
group_by(Id) %>%
summarize(mean_steps = mean(TotalSteps))
print(average_totalsteps)
## # A tibble: 33 × 2
## Id mean_steps
## <dbl> <dbl>
## 1 1503960366 12117.
## 2 1624580081 5744.
## 3 1644430081 7283.
## 4 1844505072 2580.
## 5 1927972279 916.
## 6 2022484408 11371.
## 7 2026352035 5567.
## 8 2320127002 4717.
## 9 2347167796 9520.
## 10 2873212765 7556.
## # … with 23 more rows
Calculating the Average calories per participant from the new joined data frame
average_calories <- activity_and_steps %>%
group_by(Id) %>%
summarize(mean_calories = mean(Calories))
print(average_calories)
## # A tibble: 33 × 2
## Id mean_calories
## <dbl> <dbl>
## 1 1503960366 1816.
## 2 1624580081 1483.
## 3 1644430081 2811.
## 4 1844505072 1573.
## 5 1927972279 2173.
## 6 2022484408 2510.
## 7 2026352035 1541.
## 8 2320127002 1724.
## 9 2347167796 2043.
## 10 2873212765 1917.
## # … with 23 more rows
Calculating how many calories were burned by each participant correlating it their average of total steps
average_cal_steps <- activity_and_steps %>%
group_by(Id) %>%
summarize(mean_steps = mean(TotalSteps),
mean_calories = mean(Calories))
print(average_cal_steps)
## # A tibble: 33 × 3
## Id mean_steps mean_calories
## <dbl> <dbl> <dbl>
## 1 1503960366 12117. 1816.
## 2 1624580081 5744. 1483.
## 3 1644430081 7283. 2811.
## 4 1844505072 2580. 1573.
## 5 1927972279 916. 2173.
## 6 2022484408 11371. 2510.
## 7 2026352035 5567. 1541.
## 8 2320127002 4717. 1724.
## 9 2347167796 9520. 2043.
## 10 2873212765 7556. 1917.
## # … with 23 more rows
install.packages("ggplot2")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(ggplot2)
ggplot(average_cal_steps, aes(x = mean_steps, y = mean_calories)) +
geom_point() +
geom_smooth() +
xlab("Mean Steps") + ylab("Mean Calories Burned") +
ggtitle("Mean Steps vs. Mean Calories Burned")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
I am merging these two data frames together to analyze how these two are correlated to each other. Through this join I will be answering different questions.
df_merged <- merge(daily_sleep, daily_activity, by = "Id")
Graphing the correlation between the TotalMinutesAsleep and VeryActiveDistanc
install.packages("ggplot2")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(ggplot2)
ggplot(df_merged, aes(x = TotalMinutesAsleep, y = VeryActiveDistance)) +
geom_point()+
ggtitle("Relationship Between Total Minutes Asleep and Very Active Distance")
Graphing the correlation between VeryActiveMinutes and VeryActiveDistance
ggplot(data = daily_activity) +
geom_bar(mapping = aes(x = VeryActiveMinutes, y = VeryActiveDistance),
stat = "identity", fill = "#ae7855") +
labs(title = "Very Active Distance compared to VeryActiveMinutes",
subtitle = "Does more a more active state = a farther distance",
x = "minutes",
y = "distance") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Based on the analysis of Fitbit data, it was found that those who wore their tracker during very active moments tended to have lower average sleep times. This goes against the popular belief that increased physical activity leads to increased tiredness. In fact, participants who were less active actually slept more. Additionally, it was found that those who took more steps in a day burned more calories compared to those who took fewer steps. Participants who traveled greater distances while being active also tended to spend less time sleeping. These insights could potentially be applied to the Bellabeat Leaf. By understanding the relationship between physical activity and sleep, the Leaf could potentially help users to optimize their daily activity levels and improve their overall well-being.
Some potential insights that Bellabeat could consider to apply to the leaf are: